04:24
2026-06-21
thewatershed.markpesce.com
ai-safety
Why Evals are Hard
AI evaluations are failing as models approach general intelligence, with benchmarks saturating through contamination and Goodhart effects while the scope of evaluation expands from minutes to months. โฆ